Simple estimation of absolute free energies for biomolecules 
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One reason that free energy difference calculations are notoriously difficult in molecular systems 
is due to insufficient conformational overlap, or similarity, between the two states or systems of 
interest. The degree of overlap is irrelevant, however, if the absolute free energy of each state 
can be computed. We present a method for calculating the absolute free energy that employs a 
simple construction of an exactly computable reference system which possesses high overlap with 
the state of interest. The approach requires only a physical ensemble of conformations generated via 
simulation, and an auxiliary calculation of approximately equal central-processing-unit (CPU) cost. 
Moreover, the calculations can converge to the correct free energy value even when the physical 
ensemble is incomplete or improperly distributed. As a "proof of principle," we use the approach to 
correctly predict free energies for test systems where the absolute values can be calculated exactly, 
and also to predict the conformational equilibrium for leucine dipeptide in implicit solvent. 

PACS numbers: pacs 
Keywords: free energy,entropy 



I. INTRODUCTION 

Knowledge of the free energy for two different states or 
systems of interest allows the calculation of solubilitiesji*^ 
determines binding affinities of ligands to proteins*-^ 
and determines conformational equilibria (e.g., Ref. |5|). 
Free energy differences (AF) therefore have potential ap- 
plication in structure-based drug design where current 
methods rely on ad hoc protocols to estimate binding 
affinities 

Poor "overlap," the lack of configurational similarity 
between the two states or systems of interest, is a key 
cause of computational expense and error in AF calcu- 
lations. The most common approach to improve overlap 
in free energy calculations (used in thermodynamic inte- 
gration, and free energy perturbation) is to simulate the 
system at multiple hybrid, or intermediate stages (e.g., 
Illlll2f) . However, the simulation of interme- 
diate stages greatly increases the computational cost of 
the AF calculation. 

Here, we address the overlap problem by calculating 
the absolute free energy for each of the end states, thus 
avoiding the need for any configurational overlap. Our 
method relies on the calculation of the free energy dif- 
ference between a reference system (where the exact free 
energy can be calculated, either analytically or numeri- 
cally) and the system of interest. 

Such use of a reference system with a computable 
free energy has been used successfully in solids where 
the reference system is generally a harmonic or Einstein 
solidfl^JA and liquid systems, where the reference system 
is usually an ideal gasii£*i& The scheme has also been ap- 
plied to molecular systems by Stoessel and Nowak, using 
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a harmonic solid in Cartesian coordinates as a reference 
systemiii 

Other approaches to calculate the absolute free ener- 
gies of molecules have been developed. Meirovitch and 
collaborators calculated absolute free energies for pep- 
tides in vacuum, for liquid argon and water using the 
hypothetical scanning methodpiS^ Computational cost 
has thus far limited the approach to peptides with sixty 
degrees of freedom^ The "mining minima" approach, 
developed by Gilson and collaborators, estimates the ab- 
solute free energy of complex molecules by attempting 
to enumerate the low-energy conformations and estimat- 
ing the contribution to the configurational integral for 
eachi2ii2£ Anharmonic effects can be included^ The 
mining minima method can, in principle, include po- 
tential correlations between the torsions and bond an- 
gles or lengths, and uses an approximate method to 
compute local partition functions. Other investigators 
have estimated absolute free energies for molecules using 
harmonic or quasi-harmonic approximationSfS&SiSli how- 
ever, as discussed in Refs. l23l andl24l local minima can be 
deviate substantially from a parabolic shape. 

We introduce, apparently for the first time, a refer- 
ence system which is constructed to have high overlap 
with fairly general molecular systems. The approach can 
make use of either internal or Cartesian coordinates. For 
biomolecules, using internal coordinates greatly enhances 
the accuracy of the method since internal coordinates are 
tailored to the description of conformations. Further, all 
degrees of freedom and their correlations are explicitly 
included in the method. 

Our method differs in several ways from the impor- 
tant study of Stoessel and Nowaktf (i) we use internal 
coordinates for molecules which are key for optimizing 
the overlap between the reference system and the sys- 
tem of interest; (ii) we may use a nearly arbitrary ref- 
erence potential because only a numerical reference free 
energy value is needed, not an analytic value; (iii) there 
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is no need, in cases we have studied, to use multi-stage 
methodology to find the desired free energy due to the 
overlap built into the reference system, 

We consider this report a "proof of principle" for our 
reference system method. After introducing the method, 
it is tested on single and double- well two-dimensional sys- 
tems, and on a methane molecule where absolute free 
energy estimates can be compared to exact values. The 
method is then used to compute the absolute free energy 
of the alpha and beta conformations for leucine dipep- 
tide (ACE-(lcu) 2 -NME) in implicit solvent, using all one- 
hundred fifteen degrees of freedom, correctly calculating 
the free energy difference AF a i p ha-»bota- Extensions of 
the method to larger systems are then discussed. 



II. REFERENCE SYSTEM METHOD 
A. The fundamental relations 

The absolute free energy of the system of interest 
( "phys" for physical) is defined using the partition func- 
tion Zphys 

-Fphys = ™fc,BT 111 Zphys = 

dx e -l 3 { U p^y^ +K p^ s >l 



Ignoring the constant is not a limitation since, for the 
conformational free energies studied here, the term can- 
cels for free energy differences. 



B. The reference energy and its normalization 

The trivial identities of Eq. © suggest that arbitrary 
reference systems can be used in our approach. To be 
concrete and anticipate the procedure used, our discus- 
sion below will assume that a finite-length simulation of 
the system of interest has been performed — from which 
histograms of the coordinates have been generated. For 
the molecular systems studied in this report, ordinary 
Langevin dynamics simulations are performed using stan- 
dard forcefields. The reference potential energy can be 
constructed from a wide variety of histograms, as dis- 
cussed below. Denoting the computed histograms over 
all coordinates as P(x), we define 
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where T is the system temperature, (3 = l/fc^T, /7 p h ys 
and -Kphys are, respectively, the physical potential energy 
(i.e., simulation forcefield) and the kinetic energy, and 
x represents the full set of configurational coordinates 
(internal or Cartesian). The kinetic energy term can be 
integrated exactly to obtain^ 
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where rrii is the mass of atom i, h is Planck's constant, 
C° is the standard concentration, a is the symmetry 
number, 22 N is the number of particles in the system, 
and the integral is defined to be the configurational parti- 
tion function. For method used in this study the absolute 
free energy of the system of interest is calculated using a 
reference system ("ref"), and the following relationships 
are used, 
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where -Frof is the trivially computable free energy of the 
reference system, and AF rc f^ p h V s is the free energy dif- 
ference between the reference and physical system which 
can be calculated using standard techniques. 

For this report, we include estimates of the configu- 
rational integral only, i.e., the leading constant factor in 
square brackets in Eq. (J2J) is not included in our results. 



U xei {x) = ~k B T\nP(x) 
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where P(x) is the normalized probability of a particular 
configuration (corresponding to a set of histogram bins); 
see Fig. ^ For example, if all coordinates are binned as 
independent, then 



p(£)= n p ^)' 



(5) 



where Pi(xi) is the binned probability distribution (his- 
togram) for the i th coordinate, and there are N com & s de- 
grees of freedom in the system. If all coordinates are 
binned as pairwise correlated, then 
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P{x) = Y[ Pi](Xi,Xj), 



(6) 



where 



is a set of pairs in which each coordi- 
nate occurs exactly once, and Pij(xi,Xj) is the proba- 
bility for two particular coordinate values from the two- 
dimensional histogram for these coordinates. It is also 
possible to use an arbitrary combination of independent 
and correlated coordinates — so long as each coordinate 
occurs in only one P factor. 

We emphasize that the final computed free energy val- 
ues include all correlations embodied in the true potential 
C^phys- This is true regardless of whether or how coordi- 
nates are correlated in the reference potential. 

A schematic of how U re f is computed for a one- 
coordinate system is shown in Fig. ^ The coordinate 
histogram is first determined (solid bars) using a simu- 
lation trajectory; then Eq. Q is used to calculate U le { 
(dashed bars). A possible physical potential is also in- 
cluded (dotted line) for comparison to U re { . For a system 
containing many degrees of freedom, the process is car- 
ried out for all coordinates, based on Eq. (JSJl, © or other 
correlation scheme. U ie { is the sum of all the appropriate 
terms, consistent with Eq. Q and the binning choice. 
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FIG. 1: Depiction of how the reference potential energy U le { 
is calculated for a one-coordinate system. First the coordi- 
nate is binned, creating a histogram P (solid bars) populated 
according to the physical ensemble. Then Eq. Q is used to 
calculate reference energies for each coordinate bin (dashed 
bars) . A hypothetical physical potential is shown as a dotted 
curve for comparison to U ra f. For a multi-coordinate system 
U Ic f would be the sum of the single-coordinate reference po- 
tential energies. 



perturbation^ from the reference to the physical systems 




where iV rc f is number of structures in the reference en- 
semble, the "=" symbol denotes a computational esti- 
mate, and {...) re f represents a canonical average using 
structures from the reference ensemble only. It is impor- 
tant to note that, while other choices for computing -Fphys 
are possible, such as Bennett's methodj 5 i 27 i 28 i 29 i 30 i 31 Eq. 
HI U| l is the only choice which relies solely on configura- 
tions drawn from the reference ensemble which are, by 
construction, sampled canonically and without dynam- 
ical trapping. We also note that "uni-directional" esti- 
mates like that of Eq. (| 1 UB have been analyzed exten- 
sively (e.g., Refs. 112 and l33^> and may be amenable to 
error-reduction techniques j 3 ^*^ however, we have applied 
the perturbation approach here to keep our initial anal- 
ysis as straightforward as possible. Staged free energy 
methods like thermodynamic integration 3 ^ and adaptive 
integration^ may also be used. 



The free energy of the reference system can now be 
calculated via the reference partition function 

Z Tei = J dx e-^—W = J dx P{x). (7) 

In practice, we normalize the histogram for each coordi- 
nate to one independently by summing over all histogram 
bins. So, for a particular bond length ri, that is binned 
as independent, we account for the Jacobian factor (see 
Eq. (nj) by defining £ = rf/3, and then 

Zi = f d£ P(£) = J2 ^ p (0 = 1. (8) 

J Jv bin 

where A£ is the histogram bin size, and iVbin is the num- 
ber of bins in the r\ histogram. (Binning choices are 
discussed below.) Similar relationships are used for all 
coordinates. Thus the reference free energy F le { = and 
Eq. becomes 

Fphys — A-F r ef->phyS (F Ie { = 0) (9) 



C. Using the physical and reference ensembles 

With the reference potential energy U Ic i defined in 
Eq. (@J and the physical potential energy U p h ys given by 
the forcefield, which may include implicit solvation en- 
ergies, Boltzmann-distributed snapshots from both the 
reference and physical systems can be utilized to calcu- 
late Fph ys =Ai 7 ' ro f^phys. Here, we simply use free energy 



D. The physical ensemble and construction of the 
reference system 

The method used in this report relies on simple his- 
tograms for all degrees of freedom (in principle, with 
internal or Cartesian coordinates) based on a "physical 
ensemble" of conformations generated via molecular dy- 
namics, Monte Carlo or other canonical simulation. The 
histograms define a reference system with a free energy 
that is trivially computable, as described in Sec. [H] We 
emphasize that an analytical solution need not be avail- 
able; a precise numerical evaluation is more than ade- 
quate. A well-sampled ensemble of reference system con- 
figurations is then readily generated and used to compute 
the free energy difference via Eq. (|10|) . 

The first step in our approach to constructing the ref- 
erence system is to generate a physical ensemble (i.e., 
a trajectory) by simulating the system of interest us- 
ing standard molecular dynamics, Monte Carlo, or other 
canonical sampling techniques. The trajectory produced 
by the simulation is used to generate histograms for all 
coordinates as described below. In creating histograms, 
note that constrained coordinates, such as bond lengths 
involving hydrogens constrained by RATTLE, 38 need not 
be binned since these coordinates do not change between 
configurations. Such coordinate constraints are not re- 
quired in the method, however. 

If internal coordinates are used (such as for the 
molecules in this study), care must be taken to account 
for the Jacobian factors. Using internal coordinates with 
bond lengths r, bond angles 8 and dihedrals u, the vol- 
ume element in the configurational integral of Eq. J5J is 
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given by^ 3 . 

AT-1 AT-2 AT-3 

AT-l JV-2 JV-3 

n d ^ 3 /3) n ^-cos^) n ( n ) 

2=1 2=1 2=1 

where A~ is the number of atoms in the system. Thus, 
when using internal coordinates, the simplest strategy to 
account for the Jacobian is to bin according to a set of 
rules: bond lengths are binned according to r 3 /3, bond 
angles are binned according to cos#, and dihedrals are 
binned according to lu (i.e., the same as Cartesian coor- 
dinates). 

E. Generation of the reference ensemble 

Once the histograms are constructed and populated 
using the physical ensemble, the reference ensemble is 
generated. To generate a single reference structure, for 
each coordinate one chooses a histogram bin according 
to the probability associated with that bin. Then a co- 
ordinate value is chosen at random uniformly within the 
bin according the Jacobian factor in Eq. Ijlljl — e.g., for 
a bond length r, one chooses uniformly in the variable 
(r 3 /3). The process is repeated for every degree of free- 
dom in the system. By repeating the entire procedure, 
one can generate as many reference structures as desired 
(i.e., the reference ensemble). 

F. Summary of the reference system method 

In summary, the method is implemented by first con- 
structing properly normalized histograms for all internal 
(or Cartesian) coordinates based on a physical ensem- 
ble of structures. An ensemble of reference structures is 
then chosen at random from the histograms. The refer- 
ence energy (U Te f of Eq. 10}) and physical energy (U p h ys 
from the forcefield) must be calculated for each structure 
in the reference ensemble. Finally, Eq. H1U|) is used to 
calculate the desired absolute free energy of the system 
of interest. 

The CPU cost of the method, above that of the initial 
"physical" trajectory, is one physical energy evaluation 
for each of the N Ic f reference structures, plus the less 
expensive cost of generating reference structures. 

III. RESULTS 

To test the effectiveness of the reference system method 
we first estimated the absolute free energy for three test 
systems where the free energy is known exactly. Wc 
chose the two-dimensional potentials from Ref. l39i and 
a methane molecule in vacuum. Finally, we used the 



System 


Exact 


Estimate 


two-dimensional single-well— 


-1.1443 


-1.1449 (0.0003) 


two-dimensional double-well— 


5.4043 


5.4058 (0.0003) 


Methane molecule 


10.932 


10.934 (0.002) 



TABLE I: Absolute free energy estimates obtained using our 
reference system approach for cases where the absolute free 
energy can be determined exactly. In all cases, the estimate is 
in excellent agreement with the exact free energy. The uncer- 
tainty, shown in parentheses (e.g., 3.14 (0.05) = 3.14 ±0.05), 
is the standard deviation from five independent simulations. 
The results for the two-dimensional systems are in ksT units 
and methane results have units of kcal/mole. The table shows 
estimates of the configurational integral in Eq. i.e., the 
constant term is not included in the estimate. 

method to estimate the absolute free energies of the alpha 
and beta conformations of the 50-atom leucine dipeptide 
(ACE-(leu)2-NME), and compared the free energy differ- 
ence obtained via our method with an independent esti- 
mate. In all cases, the free energy estimate computed by 
our approach is in excellent agreement with independent 
results. 



A. Simple test systems 

We first studied the two-dimensional single and double- 
well potentials from Ref. |3j| 

< y S s°(^y) = (^ + 2) 2 + y 2 , 

C/ph y u s blc (^2/) = ^{((^-l) 2 -2/ 2 ) 2 + 

10(x 2 -5) 2 + (x + y) 4 + (x-y) 4 }. (12) 

Table J] shows the excellent agreement between the ref- 
erence system estimates and the exact free energies (ob- 
tained analytically) for the two-dimensional potentials 
used in this study, Eq. I|12[l . The "physical" simulations 
used Metropolis Monte Carlo with fcsT = 1.0 and one 
million snapshots in the physical and reference ensembles. 
For all two-dimensional simulations, both coordinates 
were treated with full correlations — i.e., two-dimensional 
histograms were used — and the bin sizes were chosen such 
that the number of bins ranged from 100-1000. The error 
shown in Table [I] in parentheses is the standard devia- 
tion from five independent estimates using five separate 
physical ensembles — and thus five different reference sys- 
tems. Good estimates were also obtained using fewer 
snapshots — e.g., we obtained F = —1.142 (0.003) for 
the single-well potential and F = 5.408 (0.007) for the 
double-well potential using 10,000 snapshots in both the 
physical and reference ensembles. 

Table H] also shows the excellent agreement between 
the reference system estimates and the exact value of the 
free energy for methane in vacuum. Methane trajectories 
were generated using TINKER 4.2i& with the OPLS-AA 
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FIG. 2: Absolute free energy for methane estimated by the 
reference system method as a function of the number of refer- 
ence structures iV re f used in the estimate. The solid horizontal 
line is the exact free energy obtained by numerical integra- 
tion. Five independent simulations are shown on a log scale 
to clearly show the convergence of the free energy estimate. 
Results shown were obtained using Eq. il l Ot with one-hundred 
bins for each degree of freedom, i.e., the estimates for the ab- 
solute free energy of methane in Tableware the values shown 
here for N let = 1, 000, 000. 

forcefield4i The temperature was maintained at 300.0 
K using Langevin dynamics with a friction coefficient of 
91.0 ps" 1 and a time step of 0.5 fs. The physical ensemble 
was created by generating five 10.0 ns trajectories with 
snapshots saved every 0.1 ps. Using the 100,000 methane 
structures in the physical ensemble, the reference system 
was generated by binning internal coordinates into his- 
tograms. The absolute free energy was then estimated by 
generating 100,000 structures for the reference ensemble 
and using Eq. (f 1 C)|l . All coordinates were binned as in- 
dependent using one-hundred bins per coordinate, thus 
only one-dimensional histograms were required. The un- 
certainty shown in parenthesis in Table [I] is the stan- 
dard deviation from five independent estimates using the 
five separate methane trajectories — and thus five differ- 
ent reference systems. 

Figure [21 shows the convergence behavior of the ref- 
erence system method for methane. Five independent 
absolute free energy estimates are shown as a function of 
the number of reference structures used in the estimate. 
Each of the five simulations use the same protocol as de- 
scribed above, i.e., the absolute free energy estimates in 
Table|Hare the values shown in Fig.EJfor N rc{ = 100, 000. 

Methane was chosen as a test system because intra- 
molecular interactions are due only to bond lengths and 
angles. In the OPLS-AA forcefield no non-bonded terms 
are present in the potential energy U p h ys , and thus the 
exact absolute free energy can be computed numerically 
without great difficulty. For methane, a configuration is 
determined by: (i) four bond lengths, which are inde- 
pendent of each other and all of other coordinates in the 



forcefield; and (ii) five bond angles which are correlated 
to one another but not to the bond lengths. Thus the 
exact partition function Z me th is a product of four bond 
length partition functions Z r and one angular partition 
function Z$, 

Zmeth = Z*Zg, 

/>oo 

Z r = / dr e-^^WM, 
Jo 

z e = [ dMM<M<M05 e - pu ^ {eiM3M5) . (13) 

Jo 

^phys(?") is harmonic and thus Z r was computed an- 
alytically using parameters from the forcefield. For 
^phys(#i, 02, #3, #4, #5) the correlations between angles 
must be taken into account, thus Zg was estimated 
numerically using TINKER to evaluate U p h ys in the 
five-dimensional integral. We found that F mct h = 
— ksTln Z mct h — 10.932 kcal/mol as shown in Tabled 

Methane was also used to show that the method cor- 
rectly computes the free energy even when the physical 
ensemble is incorrect or incomplete. In our studies we 
found that the correct free energy is obtained using our 
method even when the histogram for each coordinate was 
assumed to be flat, i.e., without the use of a physical en- 
semble (data not shown). 

Choosing the size of the histogram bins is an impor- 
tant consideration. Figurc|3shows the large "sweet spot" 
where bins are large enough to be well populated, and yet 
small enough to reveal histogram features. The figure 
shows results for the absolute free energy for a methane 
molecule using ten-thousand structures in both the phys- 
ical and reference ensembles, iV p h ys = N xe f — 10, 000, 
(dashed curve) and N p h ys = N le { = 100, 000 (solid 
curve). The small vertical scale of two kcal/mol and 
the logarithmic horizontal scale emphasize that there is a 
wide range of bin sizes that produce excellent results for 
the reference system approach. Error bars are the stan- 
dard deviation of five independent simulations. The solid 
horizontal line shows the exact free energy and the curves 
are free energy estimates, using Eq. (|10|) as a function of 
the number of bins used for the histograms for all degrees 
of freedom. From this plot it is clear that one should 
choose at least fifty bins, and that the maximum number 
of bins that should be used depends on the number of 
snapshots in the physical ensemble — more snapshots in 
the physical ensemble means one can use more bins for 
the reference system. 

B. Leucine dipeptide 

Table [n] shows the agreement for leucine dipep- 
tide (ACE-(leu)2-NME) between the free energy differ- 
ence AF a ipha^bcta as predicted by the reference sys- 
tem method, and as predicted via long simulation. The 
leucine dipeptide physical ensembles were generated us- 
ing TINKER 4.2ia with the OPLS-AA forcefieldii The 
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FIG. 3: Absolute free energy for methane estimated by the 
reference system method as a function of the number of his- 
togram bins used for each degree of freedom. The plot shows 
the "sweet spot" where histogram bins are small enough to 
reveal histogram features, yet large enough to give sufficient 
population in each bin. The results are shown with a vertical 
scale of two kcal/mol and on a log scale to emphasize the wide 
range of bin sizes that produce excellent results for the ref- 
erence system approach. Results shown were obtained using 
Eq. 11011 for a methane molecule using JV p hys = N ie i = 10, 000 
(dashed curve) and iVphys = -?Vref = 100,000 (solid curve). 
The solid horizontal line shows the exact free energy and the 
errorbars are the standard deviations of five independent tri- 
als. The plot demonstrates at least fifty bins should be used 
for each independent coordinate, and that the maximum num- 
ber of bins depends on the number of snapshots in the physical 
ensemble. 



System 


Estimate (kcal/mol) 


Independent Estimate 


-Falpha 


87.3 (0.7) 




-fbcta 


86.3 (0.7) 




A.Z*alpha — >bcta 


-1.0 (0.9) 


-0.85 (0.05) 



TABLE II: Absolute free energy estimates of the alpha 
(-Faipha) and beta (-Fbeta) conformations obtained using the 
reference system method for leucine dipeptide with GBSA 
solvation, in units of kcal/mol. The independent measure- 
ment for the free energy difference was obtained via a 1.0 /is 
unconstrained simulation. The uncertainty for the absolute 
free energies, shown in parentheses, is the standard devia- 
tion from five independent 10.0 ns leucine dipeptide simula- 
tions using one-million reference structures in the reference 
ensemble. The uncertainty for the free energy differences is 
obtained by using every possible combination of fkipha and 
-Fbeta, i-e., twenty-five independent estimates. The standard 
error associated with the AF a i p ha-»bota reference system esti- 
mate is 0.18 kcal/mol, reflecting the twenty-five independent 
estimates. The table shows estimates of the configurational 
integral in Eq. i.e., the constant term is not included in 
the estimate. 



temperature was maintained at 500.0 K (to enable an in- 
dependent AF estimate via repeated crossing of the free 
energy barrier between alpha and beta configurations), 
using Langevin dynamics with a friction coefficient of 5.0 
ps _1 . GBSA 42 implicit solvation was used, and RATTLE 
was utilized to maintain all bonds involving hydrogens at 
their ideal lengths^ allowing the use of a 2.0 fs time step. 

We calculated reference systems and computed ab- 
solute free energies of the alpha and beta conforma- 
tions based on five 10.0 ns trajectories. For all simu- 
lations, backbone torsions were constrained using a flat- 
bottomed harmonic restraint (zero force if the torsion 
angles were within the allowed range, and harmonic oth- 
erwise), namely, for alpha: —105 < cj) < —45 and — 70 < 
rp < -10; and for beta: -125 < (j> < -65 and 120 < tp < 
180. The reference system was generated using 100,000 
snapshots from the physical ensemble, then free energy 
estimates were obtained by generating 1,000,000 struc- 
tures for the reference ensemble for each estimate. All 
one-hundred fifteen (excludes bond lengths constrained 
by RATTLE^) internal coordinates were binned as in- 
dependent with fifty bins for each coordinate. The un- 
certainty shown in parenthesis is the standard deviation 
from the five independent estimates using the five sep- 
arate trajectories, i.e., five different physical ensembles 
and five different reference systems. 

Since independent estimates of the absolute free en- 
ergies of the alpha and beta conformations of leucine 
dipeptide are not available, we calculated the free energy 
difference AF a i p ha->beta = —0.85 (0.05) kcal/mol via a 
1.0 /is unconstrained simulation. The uncertainty of the 
independent estimate was obtained using block averages. 
The temperature was chosen to be 500.0 K which allowed 
around 1500 crossings of the free energy barrier between 
the alpha and beta conformations, providing an accurate 
independent estimate. As can be seen in Table^J our es- 
timated free energy difference is in good agreement with 
the independent value obtained via long simulation. 

We emphasize that the nearly kcal/mol fluctuations 
observed in our leucine dipeptide estimates are com- 
pletely independent of the magnitude of the free energy 
difference of the same order. That is, for a similar sized 
system and similar CPU investment, one would expect 
similar uncertainty, even for a very large free energy dif- 
ference. This, indeed, is the motivation for performing 
absolute free energy calculations. We believe, moreover, 
that efficiency improvements will be achieved beyond the 
data in this initial report. 

Figure 0] shows the convergence behavior of the ref- 
erence system method for leucine dipeptide. Five free 
energy estimates are shown as a function of the number 
of reference structures used in the estimate for (a) the al- 
pha configuration, and (b) the beta configuration. Each 
of the five simulations use the same protocol as described 
above. 

The leucine dipeptide calculations also demonstrate 
two important aspects of the particular reference system 
defined in this study: (i) the reference system has good 
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FIG. 4: Free energy for leucine dipeptide estimated by the 
reference system method as a function of the number of refer- 
ence structures iVref used in the estimate. Five independent 
simulations are shown on a log scale to demonstrate the con- 
vergence behavior of the free energy estimate for (a) the alpha 
configuration, and (b) the beta configuration. Results shown 
were obtained using Eq. HOH with fifty bins for each degree 
of freedom. 



overlap with the physical system; and (ii) the reference 
system is broader than the physical system. Figure 
shows a scatter plot of the \2 torsions of each residue for 
both the physical and reference ensembles. Each ensem- 
ble contains 100,000 structures. The figure clearly shows 
the excellent overlap between the reference and physical 
ensemble, as can be seen by the similarity between the 
two plots. In addition, the reference ensemble scatter 
plot has data in the region (-60,-60) which does not ex- 
ist in the physical ensemble, showing that the reference 
system is "broader" than the physical system. 

Figure shows a histogram of the distance between 
the Cg atom of residue one and the C a of residue two 
for the same ensembles as Fig. \5\ The figure again shows 
how the reference system has both excellent overlap with 
the physical system and is also broader than the physical 



Physical ensemble 




X, of residue 2 



Reference ensemble 




X, of residue 2 

FIG. 5: Scatter plots of the two \2 torsions of each residue 
for leucine dipeptide. Results are shown for both physical and 
reference ensembles containing 100,000 structures each. The 
figure shows that: (i) the reference system has good overlap 
with the physical system, as can be seen by the similarity 
between the two plots; and (ii) the reference system is more 
broadly distributed than the physical system, as evidenced 
by the data at (-60,-60) for the reference system that is not 
present for the physical system. 



system. 



IV. DISCUSSION 

The present results raise a number of questions regard- 
ing the reference system approach to computing absolute 
free energies — in particular, regarding the use of correla- 
tions, the importance of the physical ensemble, and the 
potential for application to larger systems. 
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Distance between C g of residue 1 and C a of residue 2 (Angstroms) 

FIG. 6: Histogram of the distance between the C$ of residue 
one and the C a of residue two for leucine dipeptide. Results 
are shown for both reference and physical ensembles contain- 
ing 100,000 structures each. The figure shows that: (i) the 
reference system has good overlap with the physical system; 
and (ii) the reference system is broader than the physical sys- 
tem. 



A. Correlation of Coordinates 

How can correlations among coordinates be used to 
increase the method's effectiveness? One may choose 
to bin coordinates as independent (i.e., one-dimensional 
histograms), or with correlations (i.e., multi-dimensional 
histograms). For example, in peptides, one may choose 
to bin all sets of backbone 0, ip torsions as correlated, 
and all other coordinates (bond lengths, bond angles, 
other torsions) as independent. It might always seem 
advantageous to bin some coordinates (at least backbone 
torsions) as correlated, since reference structures drawn 
randomly from the histograms will be less likely to have 
steric clashes. On the other hand, including correlations 
with small bin sizes is impractical. As an example, imag- 
ine that for the leucine dipeptide molecule used in this 
study, one binned the four 0, -0 backbone torsions as cor- 
related. If fifty bins for each torsion were used (as should 
be done according to the discussion below), then there 
would be 50 4 = 6, 250, 000 multi-dimensional bins to pop- 
ulate, which is simply not feasible. 

There does appear to be an important advantage to 
eliminating at least some correlations from the original 
"physical" ensemble: namely, a larger portion of confor- 
mational space becomes available to the reference ensem- 
ble; see Figs. [5] and H3 Since coordinates for the reference 
structures are drawn randomly and independently, it is 
possible to generate reference structures that are in en- 
tirely different energy basins than those in the physical 
ensemble. It is thus possible to overcome the inadequacies 
of the physical ensemble by binning internal coordinates 
independently. The optimal (presumably) limited use of 
correlations will be considered in future work. 



Regardless of the degree of correlations included in 
U le f, we emphasize that final results fully include cor- 
relations in the physical potential U p h ys - 

B. Quality of the physical ensemble 

Since the reference ensemble is generated by drawing 
at random from histograms which, in turn, were gener- 
ated from the physical ensemble, a natural question to 
ask is: how complete does the physical ensemble need to 
be? The surprising answer is that, for our reference sys- 
tem method, the physical ensemble does not need to be 
complete, or even correct (properly distributed). Since 
Eqs. |JS} and © are valid for arbitrary reference systems, 
the convergence of the free energy estimate to the cor- 
rect value is guaranteed, in the limit of infinite sampling 
(iV re f — > oo), regardless of the quality of the physical en- 
semble. The "trick" is that the ensemble for the reference 
system must be converged, which can be achieved with 
much less expense since there is no dynamical trapping. 
Unlike the typical case for molecular mechanics simu- 
lation, we sample the reference ensemble "perfectly" — 
there is no possibility of being trapped in a local basin. 
By construction, since all coordinate values are generated 
exactly according to the reference distributions, the ref- 
erence ensemble can only suffer from statistical (but not 
systematic) error. For example, it was possible to ob- 
tain the correct free energy for methane based on 10,000 
reference structures even when the histogram for each co- 
ordinate was assumed to be flat, i.e., without the use of 
a physical ensemble (data not shown). 

It is important to note that, while convergence to the 
correct free energy is guaranteed for any choice of refer- 
ence system, the efficiency of the method could be dra- 
matically reduced if the reference system does not overlap 
well with the physical system. 

Given the fact that the physical ensemble need not 
be correct, it is easy to imagine a modified method that 
does not require simulation, but instead populates the 
histogram bins using the "bare" potential for each in- 
ternal coordinate (e.g., Gaussian histograms for bond 
lengths and angles). Of course, the conformational state 
must be defined explicitly, with upper and lower limits 
for coordinates. Allowed ranges for the torsions (espe- 
cially (j), ip) are naturally obtainable via, e.g., Ramachan- 
dran propensities (e.g., Ref . l43|) . and reasonable ranges 
for bond lengths and angles could be chosen to be, e.g., 
several standard deviations from the mean. 



C. Extension to larger systems 

While the initial results of our reference system method 
are promising, a naive implementation of the method will 
find difficulty with large systems (as do all absolute and 
relative free energy methods). For our method, the dif- 
ficulty with including a very large number of degrees of 
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freedom is due to the fact that, if one does not treat 
all correlations in the backbone, then steric clashes will 
occur frequently when generating the reference ensemble. 

However, it is possible to extend the method to larger 
peptides, still include all degrees of freedom, and bin 
all coordinates independently (important for broadening 
configurational space, as discussed above), by using a 
"segmentation" technique motivated by earlier work42^ 
Consider generating reference structures for a ten-residue 
peptide in the alpha helix conformation. Due to the large 
number of backbone torsions, most of the reference struc- 
tures chosen at random will not be energetically favor- 
able. However, if one breaks the peptide into two pieces, 
then one can generate many structures for each segment, 
and only "keep" energetically likely segment structures. 
The selected structures may be joined to form full struc- 
tures which are reasonably likely to have low energy. For 
example, if one generates 10 5 structures for each of the 
two segments and keeps only 10 3 of those, then one only 
need evaluate 10 3 x 10 3 = 10 6 full structures out of a 
possible 10 5 x 10 5 = 10 10 . A statistically correct seg- 
mentation strategy is currently being investigated by the 
authors for use in large peptides. 

Another strategy which may prove useful for larger sys- 
tems is to use the reference system method with multi- 
stage simulation. Multi-stage simulation requires the in- 
troduction of a hybrid potential energy parameterized by 
A, e.g., 

U x = XU phys + (1 - X)U Tei . (14) 

Thus, Uq = U IC { and 17% = U p \ lys . Simulations are per- 
formed using the hybrid potential energy U\ (and thus a 
hybrid forcefield, if using molecular dynamics) at inter- 
mediate A values between and 1. Conventional free en- 
ergy methods such as thermodynamic integration or free 
energy perturbation can then be used to obtain -Fphys- 

We also believe that including correlations, such as 
suggested by Eq. 10 and possibly other ways, may be 
useful. The inclusion of correlations should improve the 
overlap between the reference and physical ensembles — 
thereby reducing the amount of sampling required in the 
reference system, hence improving efficiency. This also 
will be explored in future work. (We also remind the 
reader that the final free energy value includes the full 
correlations in E/phy S , regardless of U IC {.) 

The method could prove useful in future protcin-ligand 
binding studies. In the simplest approach, one could 
freeze all degrees of freedom except for the ligand and 
side-chain degrees of freedom in the binding site. While 
the absolute free energy would be unphysical, the ap- 
proach could permit comparison of ligands or protein 
mutations with little or no conformational similarity. 

In principle, it is possible to extend the reference sys- 



tem method to include explicitly solvated biomolecules. 
However, as with all absolute free energy methods, the 
addition of the solvent degrees of freedom causes the free 
energy estimate to converge much more slowly than with- 
out explicit solvent. Thus, we feel the method described 
in this study will find use primarily in implicitly solvated 
biomolecules. 



V. CONCLUSIONS 

In conclusion, we have introduced and tested a simple 
method for calculating absolute free energies in molec- 
ular systems. The approach relies on the construction 
of an ensemble of reference structures (i.e., the reference 
system) that is designed to have high overlap with the 
physical system of interest. The method was first shown 
to reproduce exactly computable absolute free energies 
for simple systems, and then used to correctly predict 
the stability of leucine dipeptide conformations using all 
one-hundred fifteen degrees of freedom. 

Some strengths of the approach are that: (i) the refer- 
ence system is built to have good overlap with the system 
of interest by using internal coordinates and by using a 
single equilibrium ensemble from Monte Carlo or molec- 
ular dynamics; (ii) the absolute free energy estimate is 
guaranteed to converge to the correct value, whether or 
not the physical ensemble is complete and, in fact, it is 
possible to estimate the absolute free energy without the 
use of a physical ensemble; (iii) the method explicitly in- 
cludes all degrees of freedom employed in the simulation; 

(iv) the reference system need only be numerically com- 
putable, i.e., the exact analytic result is not needed; and 

(v) the method can be trivially extended to include the 
use of multi-stage simulation. The CPU cost of the ap- 
proach, beyond that for short trajectories of the physical 
system of interest, is one energy call for each reference 
structure, plus the less expensive cost of generating the 
reference ensemble. 

In the present "proof of principle" report, our method 
was used to study conformational equilibria; however we 
feel that the simplicity and flexibility of the method may 
find broad use in computational biophysics and biochem- 
istry for a wide variety of free energy problems. We have 
also described a segmentation strategy, currently being 
pursued, to use the approach in much larger systems. 
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